98 research outputs found

    Univariate decision tree induction using maximum margin classification

    Get PDF
    In many pattern recognition applications, first decision trees are used due to their simplicity and easily interpretable nature. In this paper, we propose a new decision tree learning algorithm called univariate margin tree where, for each continuous attribute, the best split is found using convex optimization. Our simulation results on 47 data sets show that the novel margin tree classifier performs at least as good as C4.5 and linear discriminant tree (LDT) with a similar time complexity. For two-class data sets, it generates significantly smaller trees than C4.5 and LDT without sacrificing from accuracy, and generates significantly more accurate trees than C4.5 and LDT for multiclass data sets with one-vs-rest methodology.Publisher's VersionAuthor Pre-Prin

    VC-dimension of univariate decision trees

    Get PDF
    PubMed ID: 25594983In this paper, we give and prove the lower bounds of the Vapnik-Chervonenkis (VC)-dimension of the univariate decision tree hypothesis class. The VC-dimension of the univariate decision tree depends on the VC-dimension values of its subtrees and the number of inputs. Via a search algorithm that calculates the VC-dimension of univariate decision trees exhaustively, we show that our VC-dimension bounds are tight for simple trees. To verify that the VC-dimension bounds are useful, we also use them to get VC-generalization bounds for complexity control using structural risk minimization in decision trees, i.e., pruning. Our simulation results show that structural risk minimization pruning using the VC-dimension bounds finds trees that are more accurate as those pruned using cross validation.Publisher's VersionAuthor Post Prin

    Quadratic programming for class ordering in rule induction

    Get PDF
    Separate-and-conquer type rule induction algorithms such as Ripper, solve a K>2 class problem by converting it into a sequence of K - 1 two-class problems. As a usual heuristic, the classes are fed into the algorithm in the order of increasing prior probabilities. Although the heuristic works well in practice, there is much room for improvement. In this paper, we propose a novel approach to improve this heuristic. The approach transforms the ordering search problem into a quadratic optimization problem and uses the solution of the optimization problem to extract the optimal ordering. We compared new Ripper (guided by the ordering found with our approach) with original Ripper (guided by the heuristic ordering) on 27 datasets. Simulation results show that our approach produces rulesets that are significantly better than those produced by the original Ripper.Publisher's VersionAuthor Post Prin

    Model selection in omnivariate decision trees using Structural Risk Minimization

    Get PDF
    As opposed to trees that use a single type of decision node, an omnivariate decision tree contains nodes of different types. We propose to use Structural Risk Minimization (SRM) to choose between node types in omnivariate decision tree construction to match the complexity of a node to the complexity of the data reaching that node. In order to apply SRM for model selection, one needs the VC-dimension of the candidate models. In this paper, we first derive the VC-dimension of the univariate model, and estimate the VC-dimension of all three models (univariate, linear multivariate or quadratic multivariate) experimentally. Second, we compare SRM with other model selection techniques including Akaike's Information Criterion (AIC), Bayesian Information Criterion (BIC) and cross-validation (CV) on standard datasets from the UCI and Delve repositories. We see that SRM induces omnivariate trees that have a small percentage of multivariate nodes close to the root and they generalize more or at least as accurately as those constructed using other model selection techniques.The authors thank the three anonymous referees and the editor for their constructive comments, pointers to related literature, and pertinent questions which allowed us to better situate our work as well as organize the ms and improve the presentation. This work has been supported by the Turkish Scientific Technical Research Council TUBITAK EEEAG 107E127Publisher's VersionAuthor Pre-Prin

    On the feature extraction in discrete space

    Get PDF
    In many pattern recognition applications, feature space expansion is a key step for improving the performance of the classifier. In this paper, we (i) expand the discrete feature space by generating all orderings of values of k discrete attributes exhaustively, (ii) modify the well-known decision tree and rule induction classifiers (ID3, Quilan, 1986 [1] and Ripper, Cohen, 1995 [2]) using these orderings as the new attributes. Our simulation results on 15 datasets from UCI repository [3] show that the novel classifiers perform better than the proper ones in terms of error rate and complexity.This work has been supported by the Turkish Scientific Technical Research Council (TUBITAK) EEEAG 107E127Publisher's VersionAuthor Pre-Prin

    Omnivariate rule induction using a novel pairwise statistical test

    Get PDF
    Rule learning algorithms, for example, RIPPER, induces univariate rules, that is, a propositional condition in a rule uses only one feature. In this paper, we propose an omnivariate induction of rules where under each condition, both a univariate and a multivariate condition are trained, and the best is chosen according to a novel statistical test. This paper has three main contributions: First, we propose a novel statistical test, the combined 5 x 2 cv t test, to compare two classifiers, which is a variant of the 5 x 2 cv t test and give the connections to other tests as 5 x 2 cv F test and k-fold paired t test. Second, we propose a multivariate version of RIPPER, where support vector machine with linear kernel is used to find multivariate linear conditions. Third, we propose an omnivariate version of RIPPER, where the model selection is done via the combined 5 x 2 cv t test. Our results indicate that 1) the combined 5 x 2 cv t test has higher power (lower type II error), lower type I error, and higher replicability compared to the 5 x 2 cv t test, 2) omnivariate rules are better in that they choose whichever condition is more accurate, selecting the right model automatically and separately for each condition in a rule.Publisher's VersionAuthor Post Prin

    Mapping classifiers and datasets

    Get PDF
    Given the posterior probability estimates of 14 classifiers on 38 datasets, we plot two-dimensional maps of classifiers and datasets using principal component analysis (PCA) and Isomap. The similarity between classifiers indicate correlation (or diversity) between them and can be used in deciding whether to include both in an ensemble. Similarly, datasets which are too similar need not both be used in a general comparison experiment. The results show that (i) most of the datasets (approximately two third) we used are similar to each other, (ii) multilayer perceptrons and k-nearest neighbor variants are more similar to each other than support vector machine and decision tree variants. (iii) the number of classes and the sample size has an effect on similarity.Publisher's VersionAuthor Pre-Prin

    Automatic propbank generation for Turkish

    Get PDF
    Semantic role labeling (SRL) is an important task for understanding natural languages, where the objective is to analyse propositions expressed by the verb and to identify each word that bears a semantic role. It provides an extensive dataset to enhance NLP applications such as information retrieval, machine translation, information extraction, and question answering. However, creating SRL models are difficult. Even in some languages, it is infeasible to create SRL models that have predicate-argument structure due to lack of linguistic resources. In this paper, we present our method to create an automatic Turkish PropBank by exploiting parallel data from the translated sentences of English PropBank. Experiments show that our method gives promising results. © 2019 Association for Computational Linguistics (ACL).Publisher's Versio

    Software defect prediction using Bayesian networks

    Get PDF
    There are lots of different software metrics discovered and used for defect prediction in the literature. Instead of dealing with so many metrics, it would be practical and easy if we could determine the set of metrics that are most important and focus on them more to predict defectiveness. We use Bayesian networks to determine the probabilistic influential relationships among software metrics and defect proneness. In addition to the metrics used in Promise data repository, we define two more metrics, i.e. NOD for the number of developers and LOCQ for the source code quality. We extract these metrics by inspecting the source code repositories of the selected Promise data repository data sets. At the end of our modeling, we learn the marginal defect proneness probability of the whole software system, the set of most effective metrics, and the influential relationships among metrics and defectiveness. Our experiments on nine open source Promise data repository data sets show that response for class (RFC), lines of code (LOC), and lack of coding quality (LOCQ) are the most effective metrics whereas coupling between objects (CBO), weighted method per class (WMC), and lack of cohesion of methods (LCOM) are less effective metrics on defect proneness. Furthermore, number of children (NOC) and depth of inheritance tree (DIT) have very limited effect and are untrustworthy. On the other hand, based on the experiments on Poi, Tomcat, and Xalan data sets, we observe that there is a positive correlation between the number of developers (NOD) and the level of defectiveness. However, further investigation involving a greater number of projects is needed to confirm our findings.Publisher's VersionAuthor Pre-Prin

    Türkçe WordNet oluşturulması ve makine çevirisinde kullanılması

    Get PDF
    Princeton kelime ağının üretilmesinden beri, diğer dillerde birçok kelime ağı üretilmiştir. Türkçe için en kapsamlı kelime ağı olan KeNet şu anda 80,000 synset içermektedir. Bu synsetler birbirleriyle anlamsal ve diller arası ilişkilerle bağlıdırlar. KeNet ayrıca Türkçe üzerinde Türkçe PropBank, alana özgü kelime ağları ve anlamsal işaretleme gibi birden çok Doğal Dil İşleme çalışmasında kullanılmıştır. Türkçe kelime ağı KeNet iki platform ile kullanıcılara sunulmuştur. KeNet’in sunulduğu ilk platform, “http://haydut.isikun.edu.tr/wordnet.ui-1.0/” web sayfasıdır. Bu platform ile, kelime veya sözcük öbekleri araması yapılabilir. Arama sonuçlarına göre, bulunan kelimelerin hiyerarşileri ile anlamsal ilişkileri görülebilir. Ayrıca Türkçe kelime ağı KeNet’i github sitesi üzerinden de erişime açtık. Bu platform sayesinde, KeNet’in tüm dosyaları indirilebilir, Java, C++ ve Python gibi üç programlama dilinde KeNet kullanılabilir. KeNet’in github web sitesindeki web sayfaları Java, C++ ve Python programlama dilleri için sırasıyla “https://github.com/olcaytaner/TurkishWordNet”, “https://github.com/olcaytaner/TurkishWordNet-CPP”, “https://github.com/olcaytaner/TurkishWordNet-Py”’dır. Bu proje raporunda, KeNet’i meydana getirirken kullandığımız yöntemleri, bu yöntemlere ait basamaklara da referans vererek anlatacağız.Since the creation of Princeton WordNet, several wordnets for other languages have been created. Being the most comprehensive wordnet for Turkish, KeNet currently includes 80,000 synsets, which are linked through their semantic as well as interlingual relations. KeNet has also been used in various NLP studies on Turkish, including Turkish PropBank, domain-specific wordnets and semantic annotations. Turkish WordNet KeNet is accessible through two platforms. The first platform, where KeNet is served, is the web page, i.e. “http://haydut.isikun.edu.tr/wordnet.ui-1.0/”. Through this platform, one can search words or lemma. According to the search results, one can view the hierarchy and semantic relations of the found words. We also serve Turkish WordNet KeNet through github site. Via this platform, one can download all the files of the KeNet, and may use KeNet in one of the three programming languages Java, C++, and Python. The web pages of KeNet in github site are “https://github.com/olcaytaner/TurkishWordNet”, “https://github.com/olcaytaner/TurkishWordNet-CPP”, “https://github.com/olcaytaner/TurkishWordNet-Py”, for Java, C++, and Python languages respectively. In this report, we present the procedure we have adopted in creating KeNet with reference to stages we have followed in its construction.Publisher's Versio
    corecore